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Abstract 



We study properties and parameter estimation of finite-state homogeneous continuous-time 
bivariate Markov chains. Only one of the two processes of the bivariate Markov chain is ob- 

["tI ' servable. The general form of the bivariate Markov chain studied here makes no assumptions 

on the structure of the generator of the chain, and hence, neither the underlying process nor 
the observable process is necessarily Markov. The bivariate Markov chain allows for simulta- 

"t^ [ neous jumps of the underlying and observable processes. Furthermore, the inter-arrival time 

of observed events is phase-type. The bivariate Markov chain generalizes the batch Markovian 
arrival process as well as the Markov modulated Markov process. We develop an expectation- 
maximization (EM) procedure for estimating the generator of a bivariate Markov chain, and 
we demonstrate its performance. The procedure does not rely on any numerical integration or 

vQ ' sampling scheme of the continuous-time bivariate Markov chain. The proposed EM algorithm 

T^lj- ' is equally applicable to multivariate Markov chains. 

^^ , Keywords: Parameter estimation, EM algorithm. Continuous-time bivariate Markov chain, 

C^^ ' Markov modulated processes 

1 Introduction 

k> I We consider the problem of estimating the parameter of a continuous-time finite-state homogeneous 

^ ' bivariate Markov chain. Only one of the two processes of the bivariate Markov chain is observable. 

The other is commonly referred to as the underlying process. We do not restrict the structure 
of the generator of the bivariate Markov chain to have any particular form. Thus, simultaneous 
jumps of the observable and underlying processes are possible, and neither of these two processes is 
necessarily Markov. In [25] , a continuous-time bivariate Markov chain was used to model delays and 
congestion in a computer network, and a parameter estimation algorithm was proposed. The model 
was motivated by the desire to capture correlations observed empirically in samples of network 
delays. A continuous-time multivariate Markov chain was used to model ion channel currents in 



*This work was supported in part by the U.S. National Science Foundation under Grant CCF-0916568. Part of the 
work in this paper was presented in a preliminary form at the 45th Conference on Information Science and Systems 
hosted by The Johns Hopkins University, Baltimore, MD, March 23-25, 2011. 



The bivariate Markov chain generaUzes commonly used models such as the batch Markovian ar- 
rival process (BMAP) [5l[13], the Markov modulated Markov process (MMMP) [8], and the Markov 
modulated Poisson process (MMPP) [10 1 120 1 [2 H ll8j. In the BMAP, for example, the generator is an 
infinite upper triangular block Toeplitz matrix. In the MMMP and MMPP, the generator is such 
that no simultaneous jumps of the underlying and observable processes are allowed. In addition, 
the underlying processes in all three examples are homogeneous continuous-time Markov chains. 

We develop an expectation-maximization (EM) algorithm for estimating the parameter of a 
continuous-time bivariate Markov chain. The proposed EM algorithm is equally applicable to mul- 
tivariate Markov chains. An EM algorithm for the MMPP was originally developed by Rvden[2T]. 
Using a similar approach, EM algorithms were subsequently developed for the BMAP in [Sj [13] 
and the MMMP in [8j. The EM algorithm developed in the present paper also relies on Ryden's 
approach. It consists of closed-form, stable recursions employing scaling and Van Loan's approach 
for computation of integrals of matrix exponentials |24j . along the lines of [181 [8]. 

In the parameter estimation algorithm of [25], the continuous-time bivariate Markov chain 
is first sampled and the transition matrix of the resulting discrete-time bivariate Markov chain is 
estimated using a variant of the Baum algorithm [3] . The generator of the continuous-time bivariate 
Markov chain is subsequently obtained from the transition matrix estimate. As discussed in [17j . 
this approach may lead to ambiguous estimates of the generator of the bivariate Markov chain, 
and in some cases it will not lead to a valid estimate. Moreover, the approach does not allow 
structuring of the generator estimate since it is obtained as a byproduct of the transition matrix 
estimate. The EM algorithm developed in this paper estimates the generator of the bivariate 
Markov chain directly from a sample path of the continuous-time observable process. This leads to 
a more accurate computationally efficient estimator which is free of the above drawbacks. 

The remainder of this paper is organized as follows. In Section [21 we discuss properties of the 
continuous-time bivariate Markov chain and develop associated likelihood functions. In Section [S] 
we develop the EM algorithm. In Section [H we discuss the implementation of the EM algorithm 
and provide a numerical example. Concluding remarks are given in Section [5] 

2 Continuous- time Bivariate Marlcov Chain 

Consider a finite-state homogeneous continuous-time bivariate Markov chain 

Z = {X,S) = {iX{t),S{t)), t>0}, (1) 

defined on a standard probability space, and assume that it is irreducible. The process S = 
{S{t),t > 0} is the underlying process with state space of say {oi, . . . , a^}, and X = {X{t),t > 0} 
is the observable process with state space of say {6i,...,6d}. The orders r and d are assumed 
known. We assume without loss of generality that ai = i for i = 1, . . . , r and hi = I for / = 1, . . . , d. 
The state space of Z is then given by {1, . . . , d} x {1, . . . , r}. Neither X nor S need be Markov. 
Necessary and sufficient conditions for either process to be a homogeneous continuous-time Markov 
chain are given in [3l Theorem 3.1]. With probability one, all sample paths of Z are right-continuous 
step functions with a finite number of jumps in any finite interval [H Theorem 2.1]. 
The bivariate Markov chain is parameterized by a generator matrix 

H = {hin{ij), l,n = l,...d;i,j = l,...r}, (2) 
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Figure 1 : Example sample path of Z = (X, 5) . 



where the set of joint states {(/, i)} is ordered lexicographically. With P(-) denoting the probability 
measure on the given space, 



hUij) = lim-P(Z(t + e) = (n,i) | Z{t) = {l,i)), 

e-*-0 e 



(3) 



for {l,i) 7^ {ti,j). The generator matrix can be expressed as a block matrix i? = {Hin, l,n = 
1, . . . ,d}, where Hin = {hin{ij), i,j = 1, . . . ,r} are r x r matrices. The number of independent 
scalar values that constitute the generator H is at most rd{rd — 1). Since none of the rows of H is 
identically zero, the submatrix Hu, / G {1, . . . ,d}, is strictly diagonally dominant, i.e., 



-hii{ii) 



{n,j):{n,j)^{l,i) j:j^i 



(4) 



for alH = 1, . . . , r, and thus, Ha is nonsingular [231 p. 476]. 

Clearly, the observable process X{t) is a deterministic function of the bivariate Markov chain 
Z{t). Conversely, the pair consisting of a univariate Markov chain together with a deterministic 
function of that chain is a bivariate Markov chain (see J19j ) . 



2.1 Density of observable process 

Assume that the observable process X of a bivariate Markov chain Z = {X, S) starts from some 
state Xq at time Tq = and jumps A^ times in [0, T] at < Ti < T2 < • • • < T^r < T. Let 
Xfc = X(Tfc) denote the state of X in the interval [Tfc,Tfc _|_i) for A; = 1, 2, . . . , A^ — 1 and let Xtv 
denote the state of X in the interval [Tn,T]. This convention differs slightly from that used in [8], 
where Xf^ = X{Tf^_i), k = 1, . . . ,N + 1. Define S^ = S{Ti:) to be the state of S at the jump time 
Tfc of X. Let Zk = Z{Tk) = {Xk, Sk)- Let AT^ = T^ — T^-i denote the dwell time of X in state 
Xk_i during the interval [Tfc_i,Tfc), k = 1, . . . , A^. We denote realizations of X^, 5^, Z^, T^, and 
ATfc by Xfc, Sfc, Zk-, tk, and At^, respectively. Figure [U depicts a sample path of a bivariate Markov 
chain Z = {X, S) for which N = 5, r = d = 2, ai = 1, a2 = 2, bi = 1, and 62 = 2. From the figure, 
we see that the sequence {Z^} is given by 



{(1,1),(2,2),(1,2),(2,2),(1,2),(2,1)}. 



(5) 



Note that in Figure [H the processes X and 5 jump simultaneously at times T^ and T5. 

The bivariate Markov chain Z is a pure-jump Markov process and is therefore strong Markov 
(see [26, Section 4-1]) Using the strong Markov property of Z, it follows that 

P{Zk+i = ^, ATfc+i <t\Zq,... ,Zk;TQ, . . . ,Tk) 

= P{Zk+i = z,An+i<T\Zk) (6) 

for all z € {1, . . . ,d} x {1, . . . , r}; r > 0; and k = 0, 1, 2, . . .. Therefore, {(Z^, T^)} is a Markov 
renewal process (see [6j). Since the observable process X can be represented in an equivalent form 
as {(XfcjTfc)}, it follows that the density of X in [0, T/v] may be obtained from the product of 
transition densities of {{Zk,Tk)}. If T > T/v, an additional term is required to obtain the density 
of X in [0,T], as will be specified shortly. 

Assuming a stationary bivariate Markov chain, the transition density of {(Zfc,Tfc)} follows from 
the density corresponding to 

P{Z{t) = {n,j),Ti G [T,T + dT) I Z{0) = {l,i)), (7) 

for / 7^ n, which we denote by /^"(t). Let /'"(r) = {fl^ir), i,j = 1, . . . ,r} denote the transition 
density matrix of {(Z^., ?fc)}. When r does not coincide with a jump time of the observable process, 
the transition probability 

^.(r) = P{S{t) = i, Ti > r I X(0) = /, S{0) = i) (8) 

is also required. Let / (r) = {JIAt), i,j = 1, . . . ,r} denote the corresponding transition matrix. 
The following proposition gives explicit forms for f (t) and / (t). 



Proposition 1. For r > 0, 



and 



/"(r) = e^"^Fz„, l^n, (9) 



f(r) = e^"^ (10) 



Furthermore, 



P{Zk+i = {n,j) I Zk = {l,i)) = [-H^i'Hir^l^. . (11) 

Proof. Following an argument similar to that given in [11''8], the density fl?{t) satisfies the following 
equation: 

/4"(r) = /.,„(zj)e'^"(^^)- + e'^"^-)- f e-hu{^^)t y^ hu{ik)ftj{t)dt. (12) 

Differentiating both sides of (J12p with respect to t and simplifying, it follows that 

^^ = hu{n)fl]{r) + Y. hamft]{T) = J^ humft]{tT). (13) 



Therefore, 

^ = Huf-ir) (14) 

with initial condition /'"(O) = Hin from (J12p . Hence, ([9]) follows. Integrating ([9]) from to oo gives 

dUD- 

The transition probability flAr) satisfies an equation similar to ()12p except that the first term 
on the right-hand side is replaced by e'*"'"^'^(5jj. This change only affects the initial condition, viz., 
/ (0) = /, where / denotes the identity matrix. Hence, (jlOp follows. D 

To simplify notation in the sequel, we shall use P{-) to denote not only a probability measure, but 
also a density, as appropriate (cf. 121^18]). The exact meaning of expressions involving P(-) should be 
clear from the context. In particular, all null probabilities are to be interpreted in the density sense. 
The density of X in [0,T] depends on the initial state probabilities fixoi''') = -P(-'^o = xq^Sq = i), 
i = 1, . . . ,r. Let 

/^xo = {/^xo(l),Mxo(2),...,MxoW} (15) 

denote the initial state distribution. Using the Markov renewal property of {(Zfc,Tfc)}, the density 
of X in [0, T] can be expressed as 



pix{t),o <t<T) = ^,Jfi n-^^^iAtk) \ r^T - t^)i. 



(16) 



where 1 denotes a column vector of all ones. This expression will be used in Section [3] to develop 
the EM recursions. 

2.2 Forward-backward recursions 



The density in ()16p can be evaluated using forward and backward recursions. The forward density 
is defined by the row vector 

Lik) = {PiXit), 0<t<tk,Sk = i), i = l,...,r}, (17) 

for k = 0,1, . . . ,N. The forward recursion is given by 

-^(0) = fixo, 

L{k) = L{k-l)n-^-''iAtk). (18) 

The backward density is defined by the column vector 

Rik) = {P{X{t), tk-i <t<T\ Xk-i = Xk-i,Sk-i = i), i = l,..., r}', 

for A: = A^+1,A^, ...,1, where ' denote matrix transpose. The backward recursion is given by 

R{N + l) = r^{T-tN)l, 

i?(A;) = /^fe-i^fe(Atfc)ii(A; + l). (19) 



From p6]) - p^ . the density of the observable process in [0, T] is given by 

P{X{t),0<t<T) = L{k)R{k + l), k = 0,...,N. (20) 

To ensure numerical stability, it is necessary to scale the above recursions. Using an approach 
similar to that developed in |18j . the scaled forward recursion is given by 



^(0) = f^xo, 

H,)=^^'-^>f"-"'^^'^\k = l,....N, (21) 

Cfc 

where 

Ck^L{k-l)n~^^^{Atk)l, k = l,...,N. (22) 

The scaled backward recursion is given by 

R(N+l) = f"'{T-tK)l, 

jiW = /"-"''^'^'^'^+^),.^l,....A.. (23) 

Cfc 

Clearly, L{0) = L{0) and R{N+1) = R{N + 1). For k = 1, . . . ,N, one can show straightforwardly 
that the scaled and unsealed iterates of the forward and backward recursions are related by 

m = -^^^ and i?(fe) = -#^^ . (24) 

1 lm=l '^m 1 [m=k ^m 

From ()24p and (I17p . one sees that for A: = 1, . . . , A^, the scaled forward vector L{k) can be interpreted 
as the probability distribution of the underlying process S at time T^ conditioned on the observable 
sample path up to and including time t^: 

L{k) = {P{Sk = i I X{t), < t < tk),i = 1, . . . ,r} . (25) 

Thus, the density of X up to and including the A^th jump can be expressed as the product of the 
scaling constants as follows: 

/ TV \ N 

P{X{t),0 <t<tN) = L{N)1 =[llck\ L{N)1 = J] cfc. (26) 

\A:=1 / k=l 

Therefore, the log-likelihood of the observed sample path is given by 

N 

C = Y,logCk. (27) 

fc=i 

The above forward and backward recursions can be generalized to apply to any time t between 
jump times of the observable process. In particular, for t € [ifc,ifc+i)) consider the row vector 

£{t) = {P{S{t) = i I X{t),0 <T<t), i = l,...,r}. (28) 

It follows that 

A similar result was derived by [19j . 
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2.3 Properties 

The observable process X is conditionally Markov given S, and vice versa. In contrast to the 
MMMP and BMAP (see Section I2.4p , the underlying process S of the bivariate Markov chain Z 
need not be Markov. A necessary and sufficient condition for 5 to be a (homogeneous) Markov 
chain is that there exists a matrix Q satisfying [3^ Theorem 3.1] 

d 

Q = Y,Hin (30) 

n=l 

for Z = 1, . . . , d. In this case, Q is the generator of S. 

Various statistics of X and S can be expressed in terms of the stationary distribution of Z. We 
denote this distribution by vr = {vr„j}, where iinj = ^v^t-^ooP{Z{t) = {n,j) \ Z(0) = {l,i)) and the 
set of joint states {{n,j)} is ordered lexicographically. The vector vr is the unique solution to the 
following system: 

ttH = 0, vrl = 1. (31) 

The process {Z^} is a homogeneous discrete-time Markov chain with transition probabilities 
given by (jlip in Proposition[TJ Define the rxr matrix Ai^ = —HjJ Hin, for {/, n : / ^ n G {1, . . . , n}. 
For / = n, let An be a matrix of all zeros. The transition matrix of {Z^} is given by ^ = {A;„}. 
Let Dh = d\a.g{Hii,l = 1, . . . ,d}. It follows that A = —Djj H + I. Let u = {m„j} denote the 
stationary distribution of {Zk}, where f„j = liuik^oo P{Zk = {n,j) \ Zq = {l,i)). The vector z^ is 
the unique solution to the following system: 

uA = u, 1/1 = 1. (32) 

Then z/ can be related to vr as follows (cf. [10, (6)]): 

u = ^^^. (33) 

ttDhI 

The dwell time of the observable process X in a given state has a phase-type distribution 

|161 Chapter 2] . The phase- type distribution generalizes mixtures and convolutions of exponential 

distributions and can be used to approximate a large class of dwell time distributions. To state this 

property formally, we denote the conditional density of the A;th dwell time AT^ given that X^-i = I 

by /ATfe|Xfc_i(T I 0- Define an = limk-,oo P{Sk = i \ X^ = I) and let ai = {aii,i = l,...,r}. The 

conditional probability an can be expressed in terms of z^ as follows: 

au = ^. (34) 

We then have the following proposition, which is proved in El 

Proposition 2. The stationary conditional distribution of AT^ given X^-i = I is phase-type. In 
particular, 

lim /AT,|x,_i(r|0 = a/e^""A, (35) 

where Pi = E„:„^«^inl- 

The phase-type dwell time of the observable process may be useful in explicit durational mod- 
eling embedded in a hidden Markov model system (cf . [9l [22l [28] ) . It should be clear that the dwell 
time of the underlying process S also has a phase-type distribution. This may be useful in certain 
applications for which the underlying process has a non-Markovian character. 
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Figure 2: Relationships among various bivariate Markov chains. 



2.4 Relation to other models 



In this section, we relate the continuous-time bivariate Markov chain to other widely used stochastic 
models mentioned in Section [TJ 

The MMMP (cf . fSj ) is a bivariate Markov chain {X, S) for which the underlying process S 
is a homogeneous irreducible Markov chain. The observable process X, conditioned on S, is a 
nonhomogeneous irreducible Markov chain. The generator of S satisfies Q = "^^=1 ^in, for all I, 
and the submatrices {Hin, I 7^ n} of the bivariate generator matrix H are all diagonal matrices. This 
implies that with probability one, the underlying and observable chains do not jump simultaneously. 
Conditioned on S{t) = i, the generator of the observable process is given by Gj = {hin{ii), l,n = 
1, . . . ,d}. Thus, the MMMP may be parameterized hy {Q,Gi, ... ,Gr}- The number of independent 
scalar values that constitute the parameter of an MMMP is at most r{r — 1) + rd{d — 1). 

The BMAP (cf . [HI [13] ) is a bivariate Markov chain ( A^, S) for which the underlying process S 
is a homogeneous irreducible finite state Markov chain, and the observable process A^ is a counting 
process with state space {0, 1,2,..., ...}. The size of each jump of A^, i.e., the batch size, lies in 
{1,2,... ,d — 1}. The generator of a BMAP is an upper triangular block Toeplitz matrix with 
first row given by {Dq, Di, . . . , Dd-i,0, 0, . . .}. The generator Q of the underlying chain S satisfies 
Q = Ylrn=0'^"i- The BMAP becomes a Markovian arrival process (MAP) when d = 1, i.e., the 
observable process can jump by at most one. For this model, we only have Dq and Di. The MAP in 
turn becomes an MMPP (cf. 110 ^ 120 ^ 121 ] 118]) when Di is a diagonal matrix with the corresponding 
Poisson rates along the main diagonal. In contrast to the MMMP and MMPP, the observable and 
underlying chains of the BMAP and MAP can jump simultaneously. 

The BMAP can be represented in the framework of this paper using a finite-state bivariate 
Markov chain Z = {X,S) where X is defined by X(t) = {N{t) mod d) + 1; i.e., the observable 
process X records the modulo-d counts of the counting process A^ of the BMAP. In this case, the 
generator of Z is given hy H = {Hin, l,n = 1, . . . , d}, where 



H. 



In 



D, 



mod d 



(36) 



and H is a block circulant matrix. The number of independent scalar values that constitute the 
parameter of the BMAP is at most r'^d — r. 

The bivariate Markov chain may also be seen as a hidden Markov process where {Z^} plays the 
role of the underlying Markov chain, and the observations are continuous random variables given by 



{ATfc} |21j . The conditional density of each observation AT^ depends on both Z^-i and Z^, which 
fohows from ([9]). A review of hidden Markov processes may be found in [7j. The EM algorithm 
developed in Section [3] is applicable to any of the above particular cases, as well as multivariate 
Markov chains (see [3]). 

3 EM Algorithm 

In this section, we describe an EM algorithm for ML estimation of the parameter of a bivariate 
chain, denoted by (j)^, given the sample path of the observable process in the interval [0,T]. In the 
EM approach, a new parameter estimate, say (p^-^-i, is obtained from a given parameter estimate, 
say (/>!, as follows: 



'h+i-- 



-.aigmaxE{logP{{Z{t),0 < t < r};(/)) | X{t),0 < t < T;(j),}, (37) 



where the expectation is taken over {5(i),0 <t< T} given the observable sample path {X(t),0 < 
t < T}. The maximization is over (j), which consists of the off-diagonal elements of the bivariate 
generator H. The density of a univariate Markov chain was derived in [1]. A similar approach can 
be used to derive the density of the bivariate Markov chain {Z{t),0 < t < T} which is required in 
(j37p . The resulting log-density is expressed in terms of the number of jumps mf^ from each state 
{l,i) to any other state {n,j) and the dwell time -D' in each state {l,i). 

Let (pii{t) = I{Z{t) = (/,«)), where /(•) denotes the indicator function, and let ^ denote set 
cardinality. Then, 

m'^ = #{t : 0<t<T,Z{t-) = {l,i),Z{t) = {n,j)} 

= J2 'Pli(t-)'Pnj(.t), (38) 

te[o,T] 

D\= j ^Hit)dt, (39) 

Jo 

where the sum in (138p is over the jump points of Z{t). The conditional mean in (I37p involves the 
conditional mean estimates 

mj^ = Eiml] \X{t),0<t<T}, and (40) 

Di = E{Di\X{t),0<t<T}, (41) 

where the dependency on (p^ is suppressed. 

The maximization in ()37p yields the following intuitive estimate in the l + 1st iteration of the 
EM algorithm pj: 

- In 

hin{ij) = '^, il,i)^{n,j). (42) 

Next, we develop closed- form expressions for the estimates m'^ and &■. 



3.1 Number of jumps estimate 



The conditional expectation of m-^ in (j38p is given by 



m. 



%= Y, PiZit-) = {l,i),Zit) = {n,j)\Xir),0<r<T). 

ie[o,T] 



(43) 



To further evaluate this expression, we consider two cases: 1) I = n, i ^ j; and 2) I ^ n. 
Case 1): {I = n, i ^ j) 



In this case, the sum in (j43p is over jumps of the underlying process S from i to j while the 
observable chain X remains in state /. The estimate in (I43p can be written as a Riemann integral 
by partitioning the interval [0, T] into N subintervals of length A such that A^A = T and then 
taking the limit as A approaches zero: 



N 



rn% 



lim y^A 



P{Zi{k - 1)A) = {l,i),Z{kA) = {l,j) I X(i),0 <T<T) 



A-i-O 



fc=l 



PiZ{t-) = (/,i),Z(t) = {l,j) I X(r),0 < T < T)dt. 



(44) 



In dSD, P{Z{t-) = {l,i),Z{t) = {l,j) I X(r),0 < t < T) denotes the conditional density given 
X{t), < r < T, of a jump of Z from (/, i) to {l,j) at time t. A result similar to (j44p was originally 
stated in [H [21 [21] . A detailed proof was provided in [1] , in the context of estimating finite-state 
Markov chains, and in [2], in the context of estimating phase-type distributions. The proof was 
adapted in [8] for estimating MMMPs. 

We have the following proposition, which is stated for the case when T = t^. 



Proposition 3. For k = 0, . . . ,N — 1, define the 2r x 2r matrix 



Ck 



H.,., H,,.^^^R{k + 2)L{k) 







J^X^Xk 



Let Ik be the r x r upper right block of the matrix exponential e '' '=+1, denoted by 
Then, 



fgi-'fe^tfc+i 



12 ■ 



7n, 



u 



HuQ Y. ^' 



K tUj L" V 



Cfc+1 



(45) 



(46) 



(47) 



where denotes element-by- element matrix multiplication. 
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Proof. Let Ij denote a column vector with a one in the ith element and zeros elsewhere. Suppose 
t G [tfc,ifc+i) and x^ = I. Applying fl8]l . (fT9l) . (l26]l in that order we obtain 

P(Z(i-) = {l,i),Z{t) = {l,j) I X(r),0 < T < T) 
_ P(Z(t-) = {l,i),Z{t) = {1,j),X{t),0 <t<T) 



P{X{t),0<t <T) 



P(X(r),0<T<T) 



r'= (t-tfc)i,/i,.,,, (u)i;r'^"'=+^ (tm-t) 



i=k+2 ) 



hiiiij)L{k) 



P{X{t),0<t<T) 
hi{ij)L{k) 



n{t-tk)hi',n'=^+itk+i-t)R{k+2) 



Cfc+l 

hiijij) 

Cfc+l 



rHi-ifc)ia;r'="'^Htfc+i-t)p(fc+2) 



n'"'+'{tk+i-t)R{k+2)L{k)r''{t-tk) 



jt 



Substituting (gSD into (glD, it follows that 



m,- 






hujij) 

Ck+l 



■fc+i 



r'=^'=+i(tfc+i-t)P(A;+2)L(fc)r'^(t-tfc)di 



j« 



Denoting the integral in the above expression by I^ and using Q and (|1U|) . we obtain 

/•At 1^1 
Ik = / e^-^-'^^^'^+'-y^H,,^^^^R{k+2)L{k)e^-k-^ydy. 

Jo 



(48) 



(49) 



(50) 



The result (I47p now follows from (j49p and (j50p . Following the approach of [181 18] , we apply the 
result in |24j to evaluate the integral in ([^U|) and obtain (|1U|) . D 



Case gj : (Z / n) 

In this case, the sum in (j43p is over the jump points of the observable process X from state / 
to state n, irrespective of jumps of S. Hence, the conditional mean of the number of jumps can be 
written as 



m, 



Y, P{Z{tk-) = il,i),Z{tk) = in,j) I X(r),0 <t<T). 



(51) 



k:xt=l, 

We have the following result, which holds for T >t]\j 
Proposition 4. Let 

Jk = R{k+2)L{k)e 



H^k-k^tk^ k = 0,...,N-l. 



(52) 
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Then for I ^ n, 



m. 



Jn 
ij 



HlnQ Yl 



Xk+i=n 



Cfc+l 



u 



Proof. Suppose k is such that x^ = I and xj^+i = n. Similarly to Proposition [3l 

P{Z{tk-) = {l,i),Z{tk) = {n,j) I X(r),0 < r <T) 
P{Z{tk-) = {l,i),Z{tk) = {n,j),XiT),0 <T<T) 



P{X{t),0<t<T) 
hniij) 



N 



Km=k+2 



P{X{t),0<t <T) 
hin{ij) 



L(k)f-''{Atk)lil',R{k+2) 



Cfe+l 

Substituting ([MD into (|5T]) . 



i?(A:+2)L(fc)r'=(Atfc) 



j« 



m. 






Cfc+1 



R{k + 2)L{k)f''H/\tu 



J1- 



(53) 



(54) 



(55) 



The result follows by using ([T0|) in ([55]) and defining J'^ as the bracketed term in that expression. D 

3.2 Dwell time estimate 

Next, we provide an expression for the dwell time estimate l)'. Taking the conditional expectation 
in (|39p . it follows that 



D 



P{Z{t) = {l,i) I X{t),0 <t < T)dt. 



We have the following result, which is stated for the case T = t^- 
Proposition 5. 



where X^. is given in (j46p . 



£> 



T 

E ' 



rZ ,Jj h" 1 



Cfc+1 



(56) 



(57) 
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Proof. The integrand in (j56p can be non-zero only for values of t for which X(t) is in state /. Hence, 
it follows that 

For t G [ifc)^A:+i) and Xfc = /, we have similarly to Proposition [3l 

P{Z{t) = {l,i),X{T),0<T<T) 



P{Z{t) = {l,i) I X{t),0<t<T) 



P{X{t),0<t<T) 



P(^(^)^o<T<r) mt-tk)hii 

L m=k+2 ) 

^^^^ -n{t-tk)ui[r'''''+Ktk+i-t)R{k+2) 



P{X{t),0 <t<T)- 

= -^\n''''+'itk+i-t)Rik+2)Lik)r''it-td . (59) 

The result follows from substituting ([5^ into ([55]) . D 

4 Implementation and Numerical Example 

The EM algorithm for continuous-time bivariate Markov chains developed in Section [3] was imple- 
mented in Python using the SciPy and NumPy libraries. The matrix exponential function from the 
SciPy library is based on a Pade approximation, which has a computational complexity of 0{r'^) for 
an r X r matrix (see [E]). For comparison purposes, the parameter estimation algorithm based on 
time-sampling proposed in [25] was also implemented in Python. We refer to this algorithm as the 
Baum-based algorithm for estimating the parameter of continuous-time bivariate Markov chains. 

4.1 Baum-based Algorithm 

In the Baum-based algorithm described in |25j . the continuous-time bivariate Markov chain Z is 
time-sampled to obtain a discrete-time bivariate Markov chain Z = {X,S) = {Zk = {Xk,Sk)}, 
where 

Zk = Z{kA), k = 0,1,2,..., 

and A is the sampling interval. Let R denote the transition matrix of Z. A variant of the Baum 
algorithm [4] is then employed to obtain a maximum likelihood estimate, R, of R. An estimate of 
the generator of the continuous-time bivariate Markov chain is obtained from 

H=^ln{R), (60) 
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where \n{R) denotes the principal branch of the matrix logarithm of R, which is given by its Taylor 
series expansion 

HR) = f;(-ir-^^^^> (61) 

whenever the series converges. Existence and uniqueness of a generator H corresponding to a 
transition matrix R are not guaranteed (see [121 Hi] ) • In practice, existence and uniqueness of H 
for a given R depend on the sampling interval A (see [IZ]). Moreover, if a generator matrix H of 
a certain structure is desired (e.g., a generator for an MMPP), that structure is difficult to impose 
through estimation of R. 

A sufficient condition for the series in (j6ip to converge is that the diagonal entries of R are all 
greater than 0.5, i.e., R is strictly diagonally dominant (see Theorem 2.2 in [T2] and Theorem 1 in 
[25]). In this case, the row sums of ln{R) are guaranteed to be zero, but some of the off-diagonal 
elements may possibly be negative [12^ Theorem 2.1]. An approximate generator can then be 
obtained by setting the negative off-diagonal entries to zero and adjusting the diagonal elements 
such that the row sums of the modified matrix are zero. If R is not strictly diagonally dominant, 
the algorithm in [25] uses the first term in the series expansion of (j6ip to obtain an approximate 
generator, i.e., 

H = ^. (62) 

4.2 Computational and Storage Requirements 

The computational requirement of the EM algorithm developed in Section [3] depends linearly on 
the number of jumps, N, of the observable process. For each jump of the observable process, 
matrix exponentials for the transition density matrix /^'=^*+i(Atfc) in (jlSp and (|19p and for the 
matrix Z^ in ()46p are computed. Computation of the matrix exponential of an r x r matrix requires 
0{r^) arithmetic operations (see jTS]). Thus, the computational requirement due to computation 
of matrix exponentials is 0{Nr'^). The element-by-element matrix multiplications in (147p and 
(j53p contribute a computational requirement of 0{N(r'^(P)). Therefore, the overall computational 
complexity of the EM algorithm can be stated as 0{N{r^ + r^cf)). The storage requirement of the 
EM algorithm is dominated by the (scaled) forward and backward variables L{k) and R{k). Hence, 
the overall storage required is 0{Nr). 

By comparison, the computational requirement of the Baum-based algorithm is 0{Nr^d?), 
where A'^ = T/A is the number of discrete-time samples. The storage requirement of the Baum- 
based algorithm is 0{Nrd). Clearly, both the computational and storage requirement of this 
algorithm are highly dependent on the choice of the sampling interval A. 

4.3 Numerical Example 

A simple numerical example of estimating the parameter of a continuous-time bivariate Markov 
chain using the EM procedure developed in Section [3| is presented in Tabled! For this example, the 
number of underlying states is r = 2 and the number of observable states is d = 2. The generator 
matrix H is displayed in terms of its block matrix components Hin, which are 2x2 matrices for 
l,n £ {1)2}. The column labeled (j)^ shows the true parameter value for the bivariate Markov 
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/ 


(t>o 


0cm 


Hu 


-70 


10 


-120 


30 


-77.03 


14.76 




20 


-55 


2 


-8 


8.46 


-47.95 


Hu 


50 


10 


70 


20 


51.80 


10.46 




25 


10 


5 


1 


32.66 


6.83 


H21 


50 





70 





49.50 










10 





1 





9.54 


H22 


-60 


10 


-100 


30 


-59.51 


10.01 




20 


-30 


2 


-3 


19.61 


-29.15 



Table 1: 



true; (/>o = initial; 



EM-based estimate. 











0b 


aum 








A 


0.1 


0.01 


0.005 


0.0025 


Hu 


-10.00 
2.35 


0.13 
-6.08 


-77.54 
11.30 


10.40 
-49.21 


-78.18 
6.16 


15.40 
-48.20 


-80.43 
8.73 


15.98 
-48.53 


H12 


0.42 
1.28 


9.44 
2.45 


57.34 
20.63 


9.81 
17.29 


49.29 
33.49 


13.48 
8.55 


52.13 

31.44 


12.32 
8.36 


H21 


0.00 
1.85 


7.92 
1.39 


7.10 
2.19 


2.91 
15.36 


52.04 
2.11 


1.18 
9.74 


51.92 
0.97 


0.45 
10.63 


H22 


-10.00 
0.53 


2.18 
-3.76 


-56.29 
4.21 


6.28 
-21.77 


-64.22 
16.09 


11.01 
-27.93 


-64.51 
17.72 


12.13 
-29.31 



Table 2: Parameter estimates obtained using the Baum-based approach of [25] with different sam- 
pling intervals A. 
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chain. Similarly, the columns labeled (f)o and (pcm show, respectively, the initial parameter and the 
EM-based estimate rounded to two decimal places. The observed data, generated using the true 
parameter (p^, consisted of A^ = 10^ observable jumps. The EM algorithm was terminated when 
the relative difference of successive log-likelihood values, evaluated using (i27j) . fell below 10"'^. 

The bivariate Markov chain parameterized by (p^ in Table [1] is neither a BMAP nor an MMMP. 
Indeed, H is not block circulant as in a BMAP and H12 is not diagonal as in an MMMP. Moreover, 
according to [3, Theorem 3.1], the underlying process S is not a homogeneous continuous-time 
Markov chain since Hu + H12 + H21 + H22 (cf. (|30l)). 

The estimate (/>em was obtained after 63 iterations of the EM procedure. An important property 
of the EM algorithm is that whenever an off-diagonal element of the generator H is zero in the 
initial parameter, the corresponding element in any EM iterate remains zero. This can be seen 
easily from Propositions [3] and [H Thus, if structural information about H is known, that structure 
can be incorporated into the initial parameter estimate and it will be preserved by the EM algorithm 
in subsequent iterations. In the example of Table [H H21 is diagonal in the initial parameter (J)q and 
retains its diagonal structure in the estimate (pem- We also see that the estimate of H21, which has 
the diagonal structure required for an MMMP, is markedly more accurate than that of H12 . Based 
on the numerical experience gained from this and other examples, we can qualitatively say that 
estimation of the diagonal elements of Hi^ {I 7^ n) tends to be more accurate and requires fewer 
iterations than that of the off-diagonal elements. 

For comparison purposes, we have implemented the Baum-based approach proposed in |25j 
and applied it to the bivariate Markov chain specified in Table [T] with true parameter (p^ and 
initial parameter estimate (po, using the sampling intervals A = 0.1, 0.01, 0.005, and 0.0025. The 
corresponding number of discrete-time samples A^ was 2408, 24077, 48153, and 96305, respectively. 
The algorithm was terminated when the relative difference of successive log-likelihood values fell 
below 10^^. The number of iterations required for the four sampling intervals was 499, 446, 105, 
and 123, respectively. In this example, a generator matrix could be obtained from the transition 
matrix estimate using (|60p for all of the sampling intervals except for A = 0.1. When A = 0.1, the 
generator was obtained using the approximation (I62p . 

The results are shown in Table [2j For all of the sampling intervals, the estimate of H21 is not 
a diagonal matrix, but the accuracy of this estimate appears to improve as A is decreased. The 
Baum-based estimates of the other block matrices Hin also appear to become closer in value to the 
EM-based estimate (pem shown in Tabled] as the sampling interval decreases. On the other hand, as 
A decreases, the computational requirement of the Baum-based approach increases proportionally, 
as discussed in Section 14.21 In the case A = 0.1, the parameter estimate is far from the true 
parameter, which is not surprising, as many jumps of the observable process are missed in the 
sampling process. Indeed, the likelihood of the final parameter estimate 0baum obtained in this 
case is actually lower than that of the initial parameter estimate (po given in Table [TJ This example 
illustrates not only the high sensitivity of the final parameter estimate with respect to the size of 
the sampling interval, but also that the likelihood values of the continuous-time bivariate Markov 
chain may decrease from one iteration to the next in the Baum-based approach. In contrast, the 
EM algorithm generates a sequence of parameter estimates with nondecreasing likelihood values. 
Conditions for convergence of the sequence of parameter estimates were given in |27] . 
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5 Conclusion 

We have studied properties of the continuous-time bivariate Markov chain and developed exphcit 
forward-backward recursions for estimating its parameter based on the EM algorithm. The pro- 
posed EM algorithm does not require any sampling scheme or numerical integration. The bivariate 
Markov chain generalizes a large class of stochastic models including the MMMP and the BMAP, 
which both generalize the MMPP but are not equivalent. In its general form, the bivariate Markov 
chain has been used to model ion channel currents (see [3]) and congestion in computer networks 
(see [25] ) . Since the proposed EM procedure preserves the zero values in the estimates of the gen- 
erator for the bivariate Markov chain, it can be applied to estimate the parameter of special cases, 
for example, the MMMP and BMAP, by specifying an initial parameter estimate of the appropriate 
form. 

A Proof of Proposition [2] 

The density corresponding to ([7]) can be expressed as 

/AT„x,A|x,_„5._.(r,n,i | l,i) = [/"(r)].„ k > 1. (63) 

Summing both sides of ([63ll over n, for n 7^ /, and over j, applying ([9]), and using /?; = "^n-njti ^in^, 
we obtain 



/ATfc|Xfc_i,5fc_i('^ I ^>^) 






[^''"^A],. 



(64) 



Applying the law of total probability, 

/AT,|x,_,(r I = Y,PiSk-i = i\Xk-i = I) [e""^l3i]^ ■ 

i 

Taking the limit as /c —t- 00, it follows that 

lim /AT,|x,_i(r|0 = «/e''"*"A- 

Equation ([66j) has the form of a phase-type distribution parameterized by {ai,Hii) [16, Chapter 
2]. Indeed, the stationary distribution of AT^ conditioned on X^-i = I is equivalent to the distri- 
bution of the absorption time of a Markov chain defined on the state space {l,2,...,r + l} with 
initial distribution given by ai and generator matrix given by 



(65) 



(66) 



G 







(67) 



where denotes a row vector of all zeros. Here, r + 1 is an absorbing state, while the remaining 
states, 1, . . . , r, are transient. 
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